SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "db:Swepub ;pers:(Lu Zhonghai);pers:(Fu Yuxiang)"

Sökning: db:Swepub > Lu Zhonghai > Fu Yuxiang

  • Resultat 1-10 av 14
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, Hui, et al. (författare)
  • A CORDIC-Based Architecture with Adjustable Precision and Flexible Scalability to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: IEEE International Symposium on Circuits and Systems, ISCAS 2020. - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • In the artificial neural networks, tanh (hyperbolic tangent) and sigmoid functions are widely used as activation functions. Past methods to compute them may have shortcomings such as low precision or inflexible architecture that is difficult to expand, so we propose a CORDIC-based architecture to implement sigmoid and tanh functions, which has adjustable precision and flexible scalability. It just needs shift-add-or-subtract operations to compute high-accuracy results and is easy to expand the input range through scaling the negative iterations of CORDIC without changing the original architecture. We adopt the control variable method to explore the accuracy distribution through software simulation. A specific case (ARCH:(1, 15, 18), RMSE: 10(-6)) is designed and synthesized under the TSMC 40nm CMOS technology, the report shows that it has the area of 36512.78 mu m(2) and power of 12.35mW at the frequency of 1GHz. The maximum work frequency can reach 1.5GHz, which is better than the state-of-the-art methods.
  •  
2.
  • Chen, Hui, et al. (författare)
  • A General Methodology and Architecture for Arbitrary Complex Number Nth Root Computation
  • 2021
  • Ingår i: 2021 SCAS 2021/IEEE International Symposium on Circuits and Systems. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • As the existing complex number Nth root computation methods are relatively discrete, we propose a general method and architecture based on coordinate rotation digital computer (CORDIC) to compute arbitrary complex number Nth root for the first time. Our method performs the tasks of computing complex modulus, complex phase angle, real Nth root, sine function and cosine function, which can be implemented by circular CORDIC, linear CORDIC and hyperbolic CORDIC. Based on these CORDICs, our proposed architecture can not only improve the hardware efficiency just through shift-add operations, but also flexibly adjust the precision and the input range of complex number Nth root. To prove its feasibility, we conduct a software simulation and implement an example circuit in hardware. Under the TSMC 28nm CMOS technology, we synthesize it and get the report that it has the area of 6561 mu m(2) and the power of 3.95mW at the frequency of 1.5GHz.
  •  
3.
  • Chen, Hui, et al. (författare)
  • An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 9:10
  • Tidskriftsartikel (refereegranskat)abstract
    • The efficient and precise hardware implementations of tanh and sigmoid functions play an important role in various neural network algorithms. Different applications have different requirements for accuracy. However, it is difficult for traditional methods to achieve adjustable precision. Therefore, we propose an efficient-hardware, adjustable-precision and high-speed architecture to implement them for the first time. Firstly, we present two methods to implement sigmoid and tanh functions. One is based on the rotation mode of hyperbolic CORDIC and the vector mode of linear CORDIC (called RHC-VLC), another is based on the carry-save method and the vector mode of linear CORDIC (called CSM-VLC). We validate the two methods by MATLAB and RTL implementations. Synthesized under the TSMC 40 nm CMOS technology, we find that a special case AR divide VR(3,0), based on RHC-VLC method, has the area of 4290.98 mu m2 and the power of 1.69 mW at the frequency of 1.5 GHz. However, under the same frequency, AR divide VC(3) (a special case based on CSM-VLC method) costs 3196.36 mu m2 area and 1.38 mW power. They are both superior to existing methods for implementing such an architecture with adjustable precision.
  •  
4.
  • Chen, Hui, et al. (författare)
  • Hyperbolic CORDIC-Based Architecture for Computing Logarithm and Its Implementation
  • 2020
  • Ingår i: IEEE Transactions on Circuits and Systems - II - Express Briefs. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-7747 .- 1558-3791. ; 67:11, s. 2652-2656
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a CORDIC (Coordinate Rotation Digital Computer)-based method to compute the logarithm function with base 2 and validate this method by software simulation and hardware implementation. Technically, we overcome the limitation of traditional hyperbolic CORDIC and transform it based on the idea of generalized hyperbolic CORDIC so that it can be used to compute $log_{2}x\;(x\;\epsilon \;[1,2))$ . The proposed method requires only simple shift-and-add operations and has a great tradeoff between precision (or speed) and area. In MATLAB, we provide different precisions corresponding to the iterations of the transformed CORDIC for user needs. Using a pipelined structure and setting the number of iterations to be 16 (the average relative error is $2.09\times 10<^>{-6}$ ), we implement an example hardware circuit. Synthesized under the SMIC 65nm CMOS technology, the circuit has an area of 24100 $\mu m<^>{2}$ and computation time of 11.1 ns, which can save 31.04x0025; area and improve 6.92x0025; computation speed averagely compared with existing methods.
  •  
5.
  • Chen, Hui, et al. (författare)
  • Low-Complexity High-Precision Method and Architecture for Computing the Logarithm of Complex Numbers
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 68:8, s. 3293-3304
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper proposes a low-complexity method and architecture to compute the logarithm of complex numbers based on coordinate rotation digital computer (CORDIC). Our method takes advantage of the vector mode of circular CORDIC and hyperbolic CORDIC, which only needs shift-add operations in its hardware implementation. Our architecture has lower design complexity and higher performance compared with conventional architectures. Through software simulation, we show that this method can achieve high precision for logarithm computation, reaching the relative error of 10(-7). Finally, we design and implement an example circuit under TSMC 28nm CMOS technology. According to the synthesis report, our architecture has smaller area, lower power consumption, higher precision and wider operation range compared with the alternative architectures.
  •  
6.
  • Chen, Hui, et al. (författare)
  • Symmetric-Mapping LUT-Based Method and Architecture for Computing X-Y-Like Functions
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 1549-8328 .- 1558-0806. ; 68:3, s. 1231-1244
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a new method and hardware architecture to compute the functions expressed as XY ( X and Y are arbitrary floating-point numbers), which can support arbitrary Nth root, exponential and power operations. Because of the complexity of direct computation, we usually convert it to logarithm, multiplication, and antilogarithm operations. Traditional approaches suffer from long latency, large area and high power consumption. To solve this problem, we propose a symmetric-mapping lookup table (SM-LUT) to be capable of computing log(2) x (x is an element of [1, 2]) and 2 x (x is an element of [0, 1]) simultaneously. It lays the foundation for computing XY. To further improve hardware performance of our architecture, we propose a multi-region address searcher to speed up the calculation of SM-LUT. In addition, we use an optimized Vedic multiplier to shorten the critical path and improve the efficiency of multiplication, which is included in computing X-Y. Under the TSMC 40nm CMOS technology, we design and synthesize a reference circuit to compute X-Y with a maximum relative error of 10(-3). The report shows that the reference circuit achieves the area of 14338.50 mu m(2) and the power consumption of 4.59 mW at the frequency of 1 GHz. In comparison with the state-of-the-art work under the same input range and similar precision, it saves 78.57% area and 80.42% power consumption for (N)root R computation and 82.89% area and 81.89% power consumption for R-N computation averagely. On top of that, our architecture reduces the computation latency by 62.77% averagely and has one more order of magnitude of energy efficiency than others.
  •  
7.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective
  • 2020
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 28:6, s. 1540-1544
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional neural networks (CNNs) have emerged as one of the most popular ways applied in many fields. These networks deliver better performance when going deeper and larger. However, the complicated computation and huge storage impede hardware implementation. To address the problem, quantized networks are proposed. Besides, various convolutional structures are designed to meet the requirements of different applications. For example, compared with the traditional convolutions (CONVs) for image classification, CONVs for image generation are usually composed of traditional CONVs, dilated CONVs, and transposed CONVs, leading to a difficult hardware mapping problem. In this brief, we translate the difficult mapping problem into the sparsity problem and propose an efficient hardware architecture for sparse binary and ternary CNNs by exploiting the sparsity and low bit-width characteristics. To this end, we propose an ineffectual data removing (IDR) mechanism to remove both the regular and irregular sparsity based on dual-channel processing elements (PEs). Besides, a flexible layered load balance (LLB) mechanism is introduced to alleviate the load imbalance. The accelerator is implemented with 65-nm technology with a core size of 2.56 mm(2). It can achieve 3.72-TOPS/W energy efficiency at 50.1 mW, which makes it a promising design for embedded devices.
  •  
8.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
  • 2019
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 8:4
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.
  •  
9.
  • Chen, Qinyu, et al. (författare)
  • Smilodon : An Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning
  • 2019
  • Ingår i: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). - : IEEE. - 9781728103976
  • Konferensbidrag (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields such as image and video recognition, recommender systems, and natural language processing. However, the massive size and intensive computation loads prevent its feasible deployment in practice, especially on the embedded systems. As a highly competitive candidate, low bit-width CNNs are proposed to enable efficient implementation. In this paper, we propose Smilodon, a scalable, efficient accelerator for low bit-width CNNs based on a parallel streaming architecture, optimized with a task partitioning strategy. We also present the 3D systolic-like computing arrays fitting for convolutional layers. Our design is implemented on Zynq XC7ZO20 FPGA, which can satisfy the needs of real-time with a frame rate of 1, 622 FPS throughput, while consuming 2.1 Watt. To the best of our knowledge, our accelerator is superior to the state-of-the-art works in the tradeoff among throughput, power efficiency, and area efficiency.
  •  
10.
  • Fu, Yuxiang, et al. (författare)
  • Congestion-Aware Dynamic Elevator Assignment for Partially Connected 3D-NoCs
  • 2019
  • Ingår i: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). - : IEEE. - 9781728103976
  • Konferensbidrag (refereegranskat)abstract
    • The combination of Network-on-Chips (NoCs) and 3D IC technology, 3D NoCs, has been proven to be able to achieve a great improvement in both network performance and power consumption compared to 2D NoCs. In the traditional 3D NoC, all routers are vertically connected. Due to the large overhead of Through-Silicon-Via (TSV, e.g., low fabrication yield and the occupied silicon area), the partially connected 3D NoC has emerged. The assignment method determines the traffic loads of the vertical links (elevators), thus has a great impact on 3D-NoCs' performance. In this paper, we propose a congestion-aware dynamic elevator assignment (CDA) scheme, which takes both the distance factors and network congestion information into account. Experiments show that the performance of the proposed CDA scheme is improved by 67% to 87% compared to the random selection scheme, 8% to 25% compared to SelByDis-1, and 13% to 18% compared to SelByDis-2.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 14

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy